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^vq ■ I would like to begin by congratulating the authors (referred to below 

as EHJT) for their interesting paper in which they propose a new variable 
Li^ ' selection method (LARS) for building linear models and show how their new 

^^ ■ method relates to other methods that have been proposed recently. I found 

the paper to be very stimulating and found the additional insight that it 
provides about the Lasso technique to be of particular interest. 

My comments center around the question of how we can select linear 
models that conform with the marginality principle [Nelder (1977, 1994) 
and McCullagh and Nelder (1989)]; that is, the response surface is invariant 
under scaling and translation of the explanatory variables in the model. 
^S| I Recently one of my interests was to explore whether the Lasso technique 

C^^ ■ or the nonnegative garrote [Breiman (1995)] could be modified such that it 

incorporates the marginality principle. However, it does not seem to be a 
(•~^ I trivial matter to change the criteria that these techniques minimize in such a 

^" ■ way that the marginality principle is incorporated in a satisfactory manner. 

On the other hand, it seems to be straightforward to modify the LARS 

1-^ ' technique to incorporate this principle. In their paper, EHJT address this 

C^ • issue somewhat in passing when they suggest toward the end of Section 3 

H ! that one first fit main effects only and interactions in a second step to control 

l^ ' the order in which variables are allowed to enter the model. However, such 

•rH . a two-step procedure may have a somewhat less than optimal behavior as 

r> I the following, admittedly artificial, example shows. 

C^ ■ Assume we have a vector of explanatory variables X = (Xi, ^2) • • • i ^lo) 

where the components are independent of each other and Xi, i = 1, . . . , 10, 
follows a uniform distribution on [0, 1] . Take as model 

(1) Y = {Xi-0.5f+X2 + X3 + X^ + X5 + e, 

where e has mean zero, has standard deviation 0.05 and is independent oi X . 
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FlG. 1. LARS analysis of simulated data with main terms only: (left) estimates of re- 
gression coefficients I3j, j = l,...,10, plotted versus ^|/3j|; (right) absolute current cor- 
relations as functions of LARS step. 



It is not difficult to verify that in this model Xi and Y are uncorrelated. 
Moreover, since the Xj's are independent, Xi is also uncorrelated with any 
residual vector coming from a linear model formed only by explanatory 
variables selected from {X2, . ■ . , Xio}. 

Thus, if one fits a main effects only model, one would expect that the 
LARS algorithm has problems identifying that Xi should be part of the 
model. That this is indeed the case is shown in Figure 1. The picture presents 
the result of the LARS analysis for a typical data set generated from model 
(1); the sample size was n = 500. Note that, unlike Figure 3 in EHJT, Fig- 
ure 1 (and similar figures below) has been produced using the standardized 
explanatory variables and no back-transformation to the original scale was 
done. 

For this realization, the variables are selected in the sequence X2, X5, 
X4, X3, Xq, XiQ, Xy, Xg, Xq and, finally, Xi. Thus, as expected, the LARS 
algorithm has problems identifying Xi as part of the model. To further 
verify this, 1000 different data sets, each of size n = 500, were simulated 
from model (1) and a LARS analysis performed on each of them. For each 
of the 10 explanatory variables the step at which it was selected to enter the 
model was recorded. Figure 2 shows for each of the variables the (percentage) 
histogram of these data. 

It is clear that the LARS algorithm has no problems identifying that 
X2,- ■ ■ jX^ should be in the model. These variables are all selected in the 
first four steps and, not surprisingly given the model, with more or less equal 
probability at any of these steps. Xi has a chance of approximately 25% of 
being selected as the fifth variable, otherwise it may enter the model at step 
6, 7, 8, 9 or 10 (each with probability roughly 15%). Finally, each of the 
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variables Xg to Xiq seems to be selected with equal probability anywhere 
between step 5 and step 10. 

This example shows that a main effects first LARS analysis followed by a 
check for interaction terms would not work in such cases. In most cases the 
LARS analysis would miss Xi as fifth variable and even in the cases where 
it was selected at step 5 it would probably be deemed to be unimportant 
and excluded from further analysis. 

How does LARS perform if one uses from the beginning all 10 main effects 
and all 55 interaction terms? Figure 3 shows the LARS analysis for the 
same data used to produce Figure 1 but this time the design matrix was 
augmented to contain all main effects and all interactions. The order in 
which the variables enter the model is X2 ■. 5 = X2 x X^ , X2 ■ 4 , X^ ■ 4 , X2 ■. 3 , 

^3:5) Xi-.bi X5:5 = X^, X4, X3, X2, X5, ^4:4, Xi; 1, Xi-q, Xi:9, Xi, .... 

In this example the last of the six terms that are actually in model (1) was 
selected by the LARS algorithm in step 16. 

Using the same 1000 samples of size n = 500 as above and performing a 
LARS analysis on them using a design matrix with all main and interaction 
terms shows a surprising result. Again, for each replication the step at which 
a variable was selected into the model by LARS was recorded and Figure 4 
shows for each variable histograms for these data. To avoid cluttering, the 
histograms in Figure 4 were truncated to [1,20]; the complete histograms 
are shown on the left in Figure 7. 
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Fig. 2. Percentage histogram of step at which each variable is selected based on 1000 
replications: results shown for LARS analysis using main terms only. 
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Fig. 3. LARS analysis of simulated data with main terms and interaction terms: (left) 
estimates of regression coefficients I3j, j = 1,. . . ,65, plotted versus ^ |/3j|; (right) absolute 
current correlations as functions of LARS step. 



The most striking feature of these histograms is that the six interaction 
terms Xi^j, i,j G {2, 3, 4, 5}, i < j, were always selected first. In no replication 
was any of these terms selected after step 6 and no other variable was ever 
selected in the first six steps. That one of these terms is selected as the first 
term is not surprising as these variables have the highest correlation with the 
response variable Y. It can be shown that the covariance of these interaction 
terms with y is by a factor ^^12/7 ~ 1.3 larger than the covariance between 
Xi and Y for i = 2, . . . , 5. But that these six interaction terms dominate the 
early variable selection steps in such a manner came as a bit as a surprise. 

After selecting these six interaction terms, the LARS algorithm then 
seems to select mostly X2 , X3 , X4 and X5 , followed soon by Xi ; 1 and 
Xi. However, especially the latter one seems to be selected rather late and 
other terms may be selected earlier. Other remarkable features in Figure 4 
are the peaks in histograms oi Xi-i for i = 2, 3, 4, 5; each of these terms seems 
to have a fair chance of being selected before the corresponding main term 
and before Xi: 1 and Xi. 

One of the problems seems to be the large number of interaction terms 
that the LARS algorithm selects without putting the corresponding main 
terms into the model too. This behavior violates the marginality principle. 
Also, for this model, one would expect that ensuring that for each higher- 
order term the corresponding lower-order terms are in the model too would 
alleviate the problem that the six interaction terms Xi-j, i,j G {2,3,4,5}, 
i<j, are always selected first. 

I give an alternative description of the LARS algorithm first before I 
show how it can be modified to incorporate the marginality principle. This 
description is based on the discussion in EHJT and shown in Algorithm 1. 
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Fig. 4. Percentage histogram of step at which each variable is selected based on 1000 
replications: results shown for variables selected in the first 20 steps of a LARS analysis 
using main and interaction terms. 



Algorithm 1 (An alternative description of the LARS algorithm). 

1. Set fJ-Q = and A; = 0. 

2. repeat 

3. Calculate c = X' [y — jl^) and set C = niaxj{|cj|}. 

4. hei A = {j:\cj\=C}. 

5. Set X_A = (• ■ • Xj • • ■)j<^A for calculating yfe+i = {X'j^XX}~^X'jj^y and a : 

6. Set 



Afc+i = Afc + 7(yfc+i-Afc), 



where, if ^'^ / 0, 



7 = mm 



(_y Oj O ~r Cj 



j&A'^ [C-Qj C + Qj 



otherwise set 7 = 1. 

7. k^k + l. 

8. until A'' = 0. 



6 DISCUSSION 

We start with an estimated response fi^ = and then iterate until all 
variables have been selected. In each iteration, we first determine (up to a 
constant factor) the correlation between all variables and the current resid- 
ual vector. All variables whose absolute correlation with the residual vector 
equals the maximal achievable absolute correlation are chosen to be in the 
model and we calculate the least squares regression response, say y^+i, us- 
ing these variables. Then we move from our current estimated response fij^ 
toward y^+i until a new variable would enter the model. 

Using this description of the LARS algorithm, it seems obvious how to 
modify the algorithm such that it respects the marginality principle. Assume 
that for each column i of the design matrix we set dij = 1 if column j should 
be in the model whenever column i is in the model and zero otherwise; here 
j ^ i takes values in {1, . . . ,m}, where m is the number of columns of the 
design matrix. For example, abusing this notation slightly, for model (1) we 
might set di : i,i = 1 and all other di ■ ij = 0; or di : 2,1 = !> ^i : 2,2 = 1 and all 
other di ^ 2,j equal to zero. 

Having defined such a dependency structure between the columns of the 
design matrix, the obvious modification of the LARS algorithm is that when 
adding, say, column i to the selected columns one also adds all those columns 
for which dij = 1. This modification is described in Algorithm 2. 

Algorithm 2 (The modified LARS algorithm). 

1. Set /jIq = and k = 0. 

2. repeat 

3. Calculate c = X' {y — fi^) and set C = maxj{|cj|}. 

4. Let Ao = {3 ■■ \cj I = C}, Ai = {j : dij / 0, i G Ao} and A = AoUAi. 

5. Set Xj, = (• • • Xj • • •)jG^ for calculating y^+i = {X'_^Xj^)~^ X'_^y and a = 

^^(yfc+i-Afc)- 

6. Set 

where, if ^'^ / 0, 

_l_ I O Cj O ~r Cj 

7 = mm < — , — 

jeA- [c-aj C + aj 

otherwise set 7 = 1. 

7. k^k + 1. 

8. until A'' = 0. 

Note that compared with the original Algorithm 1 only the fourth line 
changes. Furthermore, for all i € .4 it is obvious that for < 7 < 1 we have 

(2) |c,(7)| = (1-7)|q|, 
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Fig. 5. Modified LARS analysis of simulated data with main terms and interaction terms: 
(left) estimates of regression coefficients J3j, j = l,...,65, plotted versus 'Y^\l3j\; (right) 
absolute current correlations as functions of k = if^A'^ . 



where 0(7) = X'{y - jl^j)) and /i(7) = Afc + 7(yfe+i - Afc)- 

Note that, by definition, the value of \cj \ is the same for all j S Aq. Hence, 

the functions (2) for those variables are identical, namely (1 — j)C, and for 

all j € Ai the corresponding functions |cj(7)| will intersect (1 — 7) (7 at 7=1. 

This explains why in line 6 we only have to check for the first intersection 

between (1 — j)C and |cj(7)| for j € A'^. 

It also follows from (2) that, for all j € ^O) we have 

^'jifk+i - P'k) = sign{cj)C. 

Thus, for those variables that are in ^0 we move in line 6 of the modified 
algorithm in a direction that has a similar geometric interpretation as the 
direction along which we move in the original LARS algorithm. Namely 
that for each j E Aq the angle between the direction in which we move and 
sign(cj)xj is the same and this angle is less than 90°. 

Figure 5 shows the result of the modified LARS analysis for the same 
data used above. Putting variables that enter the model simultaneously into 
brackets, the order in which the variables enter the model is {X2: 5, X2, X^), 
(^3:4, X3, X4,), X2:5, X2:3, {Xi; i, Xi), . . . . That is, the modified LARS 
algorithm selects in this case in five steps a model with 10 terms, 6 of which 
are the terms that are indeed in model (1). 

Using the same 1000 samples of size n = 500 as above and performing a 
modified LARS analysis on them using a design matrix with all main and 
interaction terms also shows markedly improved results. Again, for each 
replication the step at which a variable was selected into the model was 
recorded and Figure 6 shows for each variable histograms for these data. To 
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Fig. 6. Percentage histogram of step at which each variable is selected based on 1000 
replications: results shown for variables selected in the first 20 steps of a modified LARS 
analysis using main and interaction terms. 



avoid cluttering, the histograms in Figure 6 were truncated to [1,20]; the 
complete histograms are shown on the right in Figure 7. 

From Figure 6 it can be seen that now the variables X2, X^, X4 and 
X5 are all selected within the first three iterations of the modified LARS 
algorithm. Also Xi : 1 and Xi are picked up consistently and early. Compared 
with Figure 4 there are marked differences in the distribution of when the 
variable is selected for the interaction terms Xi-j, i,j G {2, 3, 4, 5}, i < j, and 
the main terms Xi, i = 6, . . . , 10. The latter can be explained by the fact 
that the algorithm now enforces the marginality principle. Thus, it seems 
that this modification does improve the performance of the LARS algorithm 
for model (1). Hopefully it would do so also for other models. 

In conclusion, I offer two further remarks and a question. First, note that 
the modified LARS algorithm may also be used to incorporate factor vari- 
ables with more than two levels. In such a situation, I would suggest that 
indicator variables for all levels are included in the initial design matrix; but 
this would be done mainly to easily calculate all the correlations. The de- 
pendencies dij would be set up such that if one indicator variable is selected, 
then all enter the model. However, to avoid redundancies one would only 
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Fig. 7. Percentage histogram of step at which each variable is selected based on 1000 
replications: (left) LARS analysis; (right) modified LARS analysis. 



put all but one of these columns into the matrix Xj\^. This would also avoid 
that XjX would eventually become singular if more than one explanatory 
variable is a factor variable. 

Second, given the insight between the LARS algorithm and the Lasso al- 
gorithm described by EHJT, namely the sign constraint (3.1), it now seems 
also possible to modify the Lasso algorithm to incorporate the marginality 
principle by incorporating the sign constraint into Algorithm 2. However, 
whenever a variable would be dropped from the set Aq due to violating the 
sign constraint there might also be variables dropped from ^i. For the lat- 
ter variables these might introduce discontinuities in the traces of the corre- 
sponding parameter estimates, a feature that does not seem to be desirable. 
Perhaps a better modification of the Lasso algorithm that incorporates the 
marginality principle can still be found? 

Finally, the behavior of the LARS algorithm for model (1) when all main 
terms and interaction terms are used surprised me a bit. This behavior seems 
to raise a fundamental question, namely, although we try to build a linear 
model and, as we teach our students, correlation "measures the direction 
and strength of the linear relationship between two quantitative variables" 
[Moore and McCabe (1999)], one has to wonder whether selecting variables 
using correlation as a criterion is a sound principle? Or should we modify 
the algorithms to use another criterion? 
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